Introduction This procedure analyzes the RNA-seq data of multiple samples for quality control purpose, using the gene-level read counts. The read counts can be provided as one or multiple integer data matrixes, corresponding to different mapping types, such as unique vs. multiple mapping and sense vs. antisense strands. The following steps will be applied to the read count data:
Transcriptome of mouse brain tumors
RNA-seq data was generated from of 2 types of mouse brain tumors. Some tumors are tested for drug response. Raw data was processed to get gene-level read counts.
The total read count per sample is the sum of sequence reads of one sample mapped to all genes. Inconsistency of total reads between samples might suggest data quality issues and affect downstream analysis. For example, low total read count could be caused by high level of RNA degradation or high rate of sequencing errors. Shapiro-Wilk normality test shows that the total read counts of this data set is normally distributed (p = 0.89). Samples with extreme total read counts comparing to the others:
Figure 1. Distribution of total read counts of all samples. Total read count per sample (millions):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34.72 44.61 50.54 49.28 54.35 64.37
Due to the variability of gene length and expression level, it is expected that a large portion of the sequence reads are mapped to a small number of genes. At the same time, a large portion of the genes will have no or few reads mapped to them, making them unsuitable for statistical analysis. The distribution of read counts across genes and the consistency of such distribution between samples also provide information about RNA-seq data quality. In this data set,
Figure 3. The cumulative read counts of top 10 genes and the percent of total reads by these genes are calculated for each sample. Samples are ordered by the percents from low to high on the x-axis in this figure. The 2 samples with the lowest and the highest cumulative percents are labeled. The cumulative percents of all samples are then compared to each other to identify samples with extremely lower or higher percents than the other samples:
The read-to-gene mapping could be complicated by at least 3 conditions corresponding to 8 mapping types:
This analysis accepts multiple matching matrixes of read counts corresponding to different mapping types. By default, the first matrix corresponds to the most common mapping type and will be used for all the other analyses in this report. In this section, however, read counts of different mapping types are summarized and compared to each other when multiple matrixes are provided.
Figure 4. Total mapped reads of all samples are split by mapping types, such as unique vs. multiple, paired vs. unpaired, and sense vs. antisense. Among the 2 mapping type(s),
When the gene-level read counts of two mapping types are strongly correlated to each other, they can be combined to increase total read counts and hence statistical power of data analysis. Negative or lack of correlation between mapping types might also provide useful information.
By comparing read counts of different mapping types to each other, the ratios can be compared between samples for consistency. A sample might have quality issue if it has reads of a mapping type much less or more than the other samples.
Gene-specific dispersion of RNA-seq read counts is commonly used to evaluate between-sample variance of a data set. Here, the dispersion is estimated by the overall pattern of coefficient of variation (standard deviation devided by average read count) of all genes.
There are many ways to normalize RNA-seq data to remove systematic bias between samples, usually based on the assumption that all the samples have the same global distribution. The following analysis performs several different normalization methods and evaluate their impact on data dispersion:
Check this paper for detailed comparison of normalization methods, and this function for R code of these normalization methods.
| Method | Mean | Change(%) | Corr2Original | Num_Decrease | Num_Increase | Decrease/Increase |
|---|---|---|---|---|---|---|
| Original | 0.94 | 0.0000 | 1.00 | 0 | 0 | NaN |
| Total_Count | 0.92 | -2.2981 | 1.00 | 12800 | 6549 | 1.95 |
| Median | 0.92 | -1.6242 | 1.00 | 11365 | 8222 | 1.38 |
| Quantile_Quantile | 0.89 | -5.1722 | 0.90 | 13254 | 5918 | 2.24 |
| Upper_Quantile | 0.91 | -2.9796 | 1.00 | 14253 | 5266 | 2.71 |
| Trimmed_Mean | 0.94 | 0.1900 | 1.00 | 8850 | 10608 | 0.83 |
| Relative_Log | 0.94 | 0.1900 | 1.00 | 8850 | 10608 | 0.83 |
| DESeq | 0.92 | -2.3596 | 1.00 | 13071 | 6353 | 2.06 |
| FPKM | 0.92 | -2.2981 | 1.00 | 12884 | 6568 | 1.96 |
| TPM | 0.94 | -0.3191 | 1.00 | 10637 | 8799 | 1.21 |
| Loess | 0.87 | -7.0426 | 0.99 | 14856 | 5163 | 2.88 |
| Cyclic_Loess | 0.90 | -3.7413 | 0.99 | 13802 | 6217 | 2.22 |
This section uses read count data to perform several sample-level analyses. The results can be used to identify potential mislabeling, outlier, confounding variable, etc. Read counts normalized by the Loess method are used for all analyses in this section.
When enough genes on X and Y chromosomes have detectable expression, these genes can be used to predict the gender of samples. In case the gender information is already available, the predicted gender can be used to identify potential mislabeling.
Hierarchical clustering groups samples based on their glabal correlation of all genes. It is a iterative procedure that merges two most similar samples/groups at each step.
Principal components analysis
Check out the RoCA home page for more information.
To reproduce this report:
Find the data analysis template you want to use and an example of its pairing YAML file here and download the YAML example to your working directory
To generate a new report using your own input data and parameter, edit the following items in the YAML file:
if (!require(devtools)) { install.packages('devtools'); require(devtools); }
if (!require(RCurl)) { install.packages('RCurl'); require(RCurl); }
if (!require(RoCA)) { install_github('zhezhangsh/RoCAR'); require(RoCA); }
CreateReport(filename.yaml); # filename.yaml is the YAML file you just downloaded and edited for your analysis
If there is no complaint, go to the output folder and open the index.html file to view report.
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] Rnaseq_0.0.0.9000 DEGandMore_0.0.0.9000
## [3] samr_2.0 matrixStats_0.14.2
## [5] impute_1.42.0 DESeq_1.20.0
## [7] lattice_0.20-33 locfit_1.5-9.1
## [9] Biobase_2.28.0 gplots_3.0.1
## [11] fpc_2.1-10 vioplot_0.2
## [13] sm_2.2-5.4 htmlwidgets_0.5
## [15] DT_0.1 yaml_2.1.13
## [17] rmarkdown_0.9.6 knitr_1.13
## [19] edgeR_3.10.2 awsomics_0.0.0.9000
## [21] RoCA_0.0.0.9000 RCurl_1.95-4.8
## [23] bitops_1.0-6 RankProd_2.42.0
## [25] devtools_1.11.1 DESeq2_1.8.1
## [27] RcppArmadillo_0.5.300.4 Rcpp_0.12.4
## [29] GenomicRanges_1.22.4 GenomeInfoDb_1.6.3
## [31] IRanges_2.4.8 S4Vectors_0.8.11
## [33] BiocGenerics_0.16.1 limma_3.26.9
## [35] snow_0.4-1
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-2 httr_1.2.0 prabclus_2.2-6
## [4] tools_3.2.2 R6_2.1.2 KernSmooth_2.23-15
## [7] rpart_4.1-10 Hmisc_3.16-0 DBI_0.3.1
## [10] colorspace_1.2-6 trimcluster_0.1-2 nnet_7.3-10
## [13] withr_1.0.1 gridExtra_2.0.0 curl_0.9.7
## [16] git2r_0.15.0 formatR_1.3 caTools_1.17.1
## [19] diptest_0.75-7 scales_0.2.5 DEoptimR_1.0-3
## [22] mvtnorm_1.0-3 robustbase_0.92-5 genefilter_1.50.0
## [25] stringr_1.0.0 digest_0.6.9 foreign_0.8-66
## [28] XVector_0.10.0 htmltools_0.3.5 highr_0.5.1
## [31] RSQLite_1.0.0 jsonlite_0.9.22 mclust_5.0.2
## [34] BiocParallel_1.2.20 gtools_3.5.0 acepack_1.3-3.3
## [37] magrittr_1.5 modeltools_0.2-21 Formula_1.2-1
## [40] futile.logger_1.4.1 munsell_0.4.2 proto_0.3-10
## [43] stringi_1.0-1 MASS_7.3-43 zlibbioc_1.14.0
## [46] flexmix_2.3-13 plyr_1.8.3 grid_3.2.2
## [49] gdata_2.17.0 splines_3.2.2 annotate_1.46.1
## [52] geneplotter_1.46.0 reshape2_1.4.1 futile.options_1.0.0
## [55] XML_3.98-1.3 evaluate_0.9 latticeExtra_0.6-26
## [58] lambda.r_1.1.7 gtable_0.1.2 kernlab_0.9-22
## [61] ggplot2_1.0.1 xtable_1.7-4 e1071_1.6-7
## [64] class_7.3-13 survival_2.38-3 AnnotationDbi_1.30.1
## [67] memoise_1.0.0 cluster_2.0.3
END OF DOCUMENT